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ICEBERG REPLICATION 
CLOUDZRA 


What is APACHE ICEBERG ? 


e Bootstrapped in Netflix and then contributed to the open source community 
e The open table format for analytics datasets 
e Works with several compute engines; Spark, Trino, PrestoDB, Flink, Hive, and Impala 


Why Replication? Why not just copy? 
87 Natural 
isasters :5 . , ff А | 
e Ensure business continuity asa 7 sufficient 


e Identifying critical data 
e Evaluate & Tune ° Schema evolution 
e Hidden partitioning 


e Timetravel 


7 


What is SCHEMA EVOLUTION? 


Challenge: Update a table schema 


Folder-level operations 


Resource intensive 


AS 
Slow and inefficient 
File-level w 
Track all files, metadata and data Neth a, 
Atomic update to new metadata file 


Metadata file Metadata file 
lockless readers i 


90 
isolated writers 


Metastore/Catalog no longer a bottleneck 


Whatis PARTITIONING? 


Challenge: Maintaining partitions is a lot of work 


Hidden Partitioning 


No need for specific partition filter 
Several partitioning options 
Agnostic to a physical layout 


Evolution 


Operation at the metadata level m : 
Split Planning | | 


Partitioned_by_Month ! Partitioned_by_Hour 
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Whatis TIME TRAVEL? эж 
Catalog 


Challenge: Wish we could go back in time to fix a mistake 


Metadata file Metadata file 
Structure 

Metadata Files rn 

- Schema & partition information, snapshot details Manifa Lit Мате Liet 
Manifest List File 

— Information about all Manifest files — илээж 

— Groups all the manifest files part of the y y y 

Manifest file Manifest file Manifest file 


snapshot 
Manifest file : List of data files 
Data files : map the data on the actual disk 


y y 
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TIME TRAVEL 


Applications 


Time based OR 
snapshot ID based 
Rollback 
Projections 
Experiment 


W 
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TIMETRAVEL! 


REPLICATION MANAGEMENT SYSTEM 


Management Plane, Control Plane and Data Plane 101 


Management Plane 
e Themanagement plane decides how replication policies 
are managed, monitored and how to provide useful 
metrics. 


Control Plane 
e Thecontrolplane is responsible for replication policy 
execution between the source and the destination. 


Data Plane 
e Thedata plane is responsible for the actual movement of 


data and how it scales as the data set scales. 


Iceberg Replication Policy Definition 


Create Iceberg Replication Policy 


General Resources 


Policy Name O 


Source 


Destination 


Schedule 


Inclusion Table Filters O 


Exclusion Table Filters O 


Policy1 


Iceberg Replication (Cluster 1 @ source) 


Iceberg Replication (Cluster 1) 


Immediate M 


Database Name Table Name/Regex 


Table Name/Regex 


Database Name | - 


Database Name Table Name/Regex 


Database Name | - | Table Name/Regex 


m e 
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REPLICATION FLOW IN A NUTSHELL 


Export, transfer, then synchronize 


SOURCE 


DESTINATION 
Plan the replication by comparing versions 


Copies, transforms files, then update the metastore 


Metastore 


Metastore 
Metadata Files 


Checkpoint 


2. Transfer 


Metadata Files 
Data Files 


Data Files 
File Lists 


Executing a Replication Policy 


Steps involved in executing an Iceberg Replication Policy 


Source 


Destination 


Export Steps 


1. Reads the Policy Definition 
from the RDBMS 


2. Evaluates the table filters 


3. Get the latest status for 
each table (checkpoint file) 


4. Provides the Table list and 
checkpoint file to the Source 


Executing a Replication Policy 
Steps involved in executing an Iceberg Replication Policy 


Source Destination Export Steps 


1. Launch exportCLI, args 
Table list, checkpoint file 


2. Reads table definition from 
the Catalog (Hive metastore) 


3. Generates result, 
export.json 


4. Provides result file 
> exportCLl export.json to the Destination 


Executing a Replication Policy 
Steps involved in executing an Iceberg Replication Policy 


Source Destination 


> xferCLI È | 


Xfer Steps 


1. Launch xferCLl, args 
export.json 


2. Launches multiple 
Distributed Copy (distcp) jobs 


3. Reads data and metadata 
files from source 


4. Writes data files to final 
path, writes metadata files to 
staging path 


Executing a Replication Policy 
Steps involved in executing an Iceberg Replication Policy 


Source Destination 


Y— wq nh. m Dy 


Sync Steps 
1. Launch syncCLl 


2. Updates metadata files and 
copies to the final path 


3. For each table, updates 
Catalog (HMS) as a last step 


MANAGEMENT PLANE 


A unified plane for the policy management, monitoring, replication history and metrics 


EXECUTION MONITORING Monitoring 
Check replication status, historical runs . 
- Current Execution Status 


Replication Policies 


pum — export xfer sync done 


- Result: Success or Failure 
- Runtime of each step 


Centralized monitoring for all policies 
- History of previous runs 
- Spot the outlier 
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MANAGEMENT PLANE 


A unified plane for the policy management, monitoring, replication history and metrics 


Replication Policies 
Replication History 


Name Victory Туре ICEBERG Source iceberg replication (Cluster 1 @ Source) ^ Destination Iceberg Replication (Cluster 1) Мех! Вип June 19, 2023 10:17 AM 


Start Time Duration Outcome Number of tables Processed Number of files copied Number of files deleted Number of manifests transformed 
> June 18, 2023 10:22 AM 2min Successful 0 5 0 0 
> June 18, 2023 10:19 AM 2 min Successful 0 5 0 0 
> June 18, 2023 10:17 AM 1 min Successful 0 5 0 0 
1-38 
Policy Stats Troubleshooting & Diagnostics 
- Data, Metadata file replicated - Collect Diagnostics from Source 


- Manifest files transformed - Collect Diagnostics from Destination 


Y | 


DATA PLANE 


Teddy Choi 


ICEBERG REPLICATION ASFURNITURE SHOPPING 


Checkout, deliver, and assemble the tables 


3. Assemble 
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Manifest List Manifest Data 
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1. Export 
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Manifest List Manifest Data 
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REPLICATION PROCESS 


Export, transfer, then synchronize 


SOURCE DESTINATION 
Plan the replication by comparing versions Copies, transforms files, then update the metastore 


Metastore 


i 
x ECO 
Metadata Files 


Metadata Files 
Data Files 
File Lists É 


Data Files 
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Manifest List Manifest 
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REPLICATION PROCESS 


Export, transfer, then synchronize 


SOURCE DESTINATION 
Plan the replication by comparing versions Copies, transforms files, then update the metastore 


Metastore 


i 
x ECO 
Metadata Files 


Metadata Files 
Data Files 
File Lists É 


Data Files 
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Q: WHY DON'T YOU SIMPLY... 


A: Because of limitations of file systems that we should support 


List files recursively Read all metadata files 
Simple and fast on HDFS, Ozone ° Simple and fast for small tables 
Slow and expensive on S3 ° Can grow to millions 
Files may not follow dir layout ° HDFS namenode bottleneck 


(migrated external tables) 


Time 


DAYS: DEPTH-FIRST-SEARCH 
O(snapshots2) = O(62) = 58 file reads, time: 58 


Metadata 


Manifest list 


Manifest 
Optimization 


Destination exclusive 
Source exclusive 
Common 


Destination 


Source 


BOTTLENECKS 


There can be a lot of hidden bottlenecks 


o 
e 


LET'S BREAK BOTTLENECKS! 
And I'm good at breaking them! 
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BOTTLENECK 1: REPEATED READS 


It's slow because of repeated reads on same files, which can be deduplicated 


Depth-first-search BFS with deduplication 
Simple and fast for tree traversal BFS enables deduplication 
It causes repeated reads in graphs that ° Adedup takes O(1) with a hash table 
allows overlapping subgraphs * For millions of files, it takes GBs of 
memory 
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Time 


HOURS: BREADTH-FIRST-SEARCH WITH DEDUP 
O(snapshots) = O(6) = 18 file reads 


Optimization 


BOTTLENECK 2: UNNECESSARY READS 


It's slow because it reads all files even when a typical replication has a fraction of data 


Common snapshots Optimistic concurrency + UUID 
Should be skipped for efficiency * “the failed writer retries by writing a new 
Iceberg subgraphs can overlap metadata tree based on the new current 
Some changes are referenced by common table state" 
subgraphs e We just need to additionally read the start 


and end of common snapshots 


Time 


MINUTES: SKIPPING BREADTH-FIRST-SEARCH 


11 file reads 


Optimization 


BOTTLENECK 3: TOO MANY SMALL FILES 


It's slow because of too many small files syndrome on HDFS namenode 


Too many small files syndrome Multi-threaded 
Small files still takes a couple of seconds ° Tables are independent from each other 
for namenode, datanode connection ° Easyto make it multi-threaded per table 


establishment 


Time 


SECONDS: MULTI-THREADED SKIPPING BFS 
11 file reads, 4 threads, time: 3 


Optimization 


BOTTLENECK 4: FILE READS 


It's slower than file listing because it reads metadata files, which can be cached 


Still need to read many files Iceberg file cache 
HDFS or Ozone file listing is faster when ° All Iceberg files are immutable 
all files follow the directory layout ° Metadata files are small 


° Cached to reduce file reads 


SUMMARY OF OPTIMIZATIONS 
Optimized from O(snapshots?) to O(changed = threads) 
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O(snapshots?) O(changed - threads) 


LINEAR SCALABILITY 


Iceberg replication is linearly scalable 


SCALABILITY OF EACH PROCESS 


The transfer process is the slowest. It distributed its 
workload. 


More nodes = 
more throughput Export— multi-threaded to reduce IO wait. 


Transfer— distributed, scale-out. 


Sync— multi-threaded to reduce IO wait. 


Fixed start time 


Data size 


ENTERPRISE READY 


Optimized processes with all formats, versions, and Hadoop environments 


ALL DATA FILE FORMATS ALL TABLE SPEC VERSIONS ALL HADOOP CONFIGS 


Apache Parquet— popular columnar Iceberg table spec V1— NameNode High Availability— a hot 
format for Apache Spark and others basic operations standby for HDFS namenode 
Apache ORC— popular columnar Iceberg table spec V2— Kerberos— strong authentication 
format for Apache Hive and others row-level updates 


TLS— data in transit encryption 


Apache Avro— data serialization 
format for Apache Hadoop 
ecosystem 


Y | 


ROADMAP 


More storages, computes, use cases, and scale 


STORAGE COMPUTE USE CASE 

More vendors ЯЛ More vendors SN Richer use cases 22 

Apache HDFS— supported Private to private— Failover— a standby cluster 
supported will take over a failed active 

Apache Ozone— an open cluster automatically. 


source object store. Public to public 
Migrated external table— 
Public to private doesn't follow the default 


Amazon S3— a popular : 
PoP Iceberg directory structure. 


object store. А : 
J Private to public 


Statistics— for query 
More to come optimizers, with HMS and 
Puffin. 


SCALE " 
Larger scale Ё 


Micro batch— more 
responsive replication 


Fully distributed— for more 
linear scalability 
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